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Abstract 

This paper introduces a new mechanism for specifying constraints in 
distributed workflows. By introducing constraints in a contextual form, 
it is shown how different people and groups within collaborative commu¬ 
nities can cooperatively constrain workflows. A comparison with existing 
state-of-the-art workflow systems is made. These ideas are explored in 
practice with an illustrative example from High Energy Physics. 


1 Introduction 

The Grid is emerging as a specialized distributed computation standard of un¬ 
precedented power and scope, promising to turn commodity networks and com¬ 
puters into commodity computation. The Grid concept has already been proven 
useful for science in many applications HP Substantial infrastructure already 
exists [3j or is being planned ep Data processing on the Grid ranges from 
tightly coupled kinds of problems to loosely coupled (the so-called ” embarrass¬ 
ingly parallel”) ones. Computation involving the use of MPI 0 or some other 
parallel processing standard within a single application is an example of the 
former kind. The High Energy Physics (HEP) problem domain is character¬ 
ized by the latter: parallel independent filtering or data processing applications 
with large data flows between them. Requirements placed on large processing 
projects are often coordinated among many interests and groups within a Vir¬ 
tual Organization (VO). In the case of workflows containing a moderate number 
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of application steps, it becomes a daunting task to check that all of the input 
parameters and installed software configurations conform to decisions made at 
the collaboration level. It is useful therefore to have a mechanism with which 
it is able to specify constraints on the parameters of the individual workflow 
steps that bring them into line with collaborative decisions coherently across 
the entire workflow, possibly even dynamically as decisions or discoveries are 
being made. In the paper, it will be shown that: 

• Collections of constraints can be gathered into documents called contexts 
that function as operators on existing workflow graphs. An algebra of 
contexts supporting factoring and composition can help different groups 
within a VO work together though constraint sharing. Decomposition of 
contexts can allow for variance of constraints simultaneously across several 
different categories. 

• Constraint expressions and contexts form an interesting and hitherto largely 
unexplored area of data provenance. Knowledge of the constraints implies 
that it is possible not only to know the values of application input param¬ 
eters, but also why they were set in particular ways. 

It can be assumed that such constraints can be distributed within the stack of a 
single running application using MPI. However, the techniques developed here 
find fruitful application in the organizational aspects and in the aspects of shar¬ 
ing collaborative decisions about constraints in the multi-application domain. A 
partial implementation of these ideas already exists in workflow building tools 
IZ][H1 developed for the Compact Muon Solenoid (CMS) 0 experiment, a HEP 
experiment based at the European Center for Nuclear Research (CERN) JO! in 
Geneva, Switzerland. 

In the following, we will focus mainly on semantic constraints and not con¬ 
straints on physical resources, synchronization, nor parallelization. Many tra¬ 
ditional workflow specification schemes such as DAGMan CD and constraint 
languages such as ClassAds IE. already address these concerns. In sections 0 
and[3]it will be shown how multigraphs, including new arrow types called meta¬ 
data flows, can be used to express constraints. It will also be shown how these 
constraints can be expressed in cooperating context documents. And a general 
procedure for reducing multigraphs into fully constrained workflow descriptions 
suitable for execution by a workflow manager such as DAGMan will be out¬ 
lined. Section0]will illustrate the concepts introduced in previous sections with 
an illustrative example from High Energy Physics. Finally, the discussion will 
be wrapped up and comparisons with existing workflow systems will be made 
in the conclusion. 
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2 Objects and Operations in a Workflow Con¬ 
straint Specification 

Workflow specifications are often expressed as directed graphs. In many cases 
where the workflow consists of pure filtering and/or simple single-pass process¬ 
ing, these graphs are directed acyclic graphs (DAGs). Let G = (IV, A) be a 
general unconstrained workflow graph. Each node N in G corresponds to an 
application and the set of arrows A corresponds to a partial sequencing of the 
nodes, usually generated by real data flow relationships 1 . Each of these nodes 
may have attributes specifying input parameters to or conditions on the corre¬ 
sponding application. In order to express constraints on G, one may add both 
nodes and arrows. The resulting data structure is a multigraph 2 . The extra 
arrows correspond to constraint relationships between specific node attributes. 
These extra arrows will be called metadata flows, and the set of all metadata flow 
arrows constraining a graph G will be denoted by F. In addition to metadata 
flows, special nodes may be added whose only purpose is to serve as sources or 
sinks for metadata flows. These extra nodes are called metadata terminals, and 
the set of these will be denoted M. This is illustrated in figure ^ An example 
of a metadata source terminal is a null workflow node that holds a query result 
from some catalog. An example of a metadata sink is a node that may consume 
metadata merely to record it, such as a tracking system or provenance recorder. 
In general, metadata sources and sinks may be replaced by a single source and 
a single sink node, but we may also allow for many. The multigraph containing 
G and its constraints expressed by F and M will be denoted Mq. This is the 
constrained and unreduced workflow graph. Several types of arrows are evident 
in Mq including conventional workflow sequencing arrows and metadata flows. 
For a summary of all of the symbols being introduced, please refer to tabled 
Operations on the unreduced constrained graph consist of reduction oper¬ 
ations that satisfy the metadata flows and thereby gradually reduce Mq to a 
constrained and reduced G ', or a workflow graph with zero metadata flows. Let 
F{I.i , J) denote a metadata flow, where I is a node in N and J is a node in 
N + M, and i is an attribute in I. The first argument is the target of the con¬ 
straint while the second argument is the domain. A reduction Rp over F is an 
operation that replaces the value of attribute I.i by some value which satisfies 
the constraint computed from the domain J. For example, the simplest such 
operation is just the assignment reduction =p. The constraint I.i =p J.j on 
I.i is that it be equal to J.j where j is some attribute in J. 3 Categorically, 
the unifying character of this picture can be seen by considering that general 
constraint reductions targeting an attribute value in some workflow node are 

1 Alternatively, the nodes may correspond to data products and the arrows may correspond 
to data transformations, as in the Chimera Virtual Data System m These two pictures are 
equivalent. 

2 A graph comprises some set N of nodes and a set A of arrows such that for any two nodes 
N and M in N, N and M can have at most one arrow a in A between them. In a multigraph, 
the restriction on the number of arrows is dropped. 

3 Or, more generally, j can be a simple expression involving attributes of J. 
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(A) (B) 


Figure 1: (A) A simple four application acyclic workflow graph. In order to 
execute the graph, all node attributes must be set. (B) This simple graph is 
augmented with metadata flows (pink) specifying various constraints on the 
node attributes. In addition a metadata source node has been added. The 
constraint augmented workflow is now a multigraph. 
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Graph 

name 

Nodes 

Arrows 

G 

Workflow Graph 

N 

A 

F 

Metadata Flows 

- 

F 

M 

Metadata Terminals 

M 

- 

C 

Context 

M 

F 

Mq 

Constrained Unreduced Workflow 

M + N 

A + F 

M'q 

Metadata Flow Subgraph 

M + N 

F 

G' 

Constrained and Fully Reduced 

N 

A 


Table 1: The above table summarizes the different graph entities introduced in 
in section [3 (Context will be defined in section [3 ) 


equivalent to constant assignment reductions emanating from a single metadata 
terminal node in 

Let M'q be the metadata flow subgraph of Mq such that Mq contains all 
of the nodes of Mq but only the arrows in F. It should be noted that this 
subgraph should be acyclic so that at least one serialization of the metadata 
flows F exists. Otherwise the constraint model is undefined. Metadata flows 
may exist to carry metadata to a finite number of graph nodes in a possibly 
cyclic G, and each node in G has a finite number of attributes. Thus it is always 
possible to find a finite spanning tree if there are no more than one metadata 
flow per target attribute. 

Reduction always results in the removal of a single arrow from F and possibly 
the alteration of at most one attribute in the target node of the flow. The order 
of reduction is determined by a partial ordering of the nodes in Mq. Another 
possibility is to check and raise an exception in case a boolean constraint eval¬ 
uates to false. Reduction can continue until Mq has been transformed into G '. 
Nonetheless, there are many such reduction partial orders. Some optimization 
may be gained by grouping some reduction operations together if it is known in 
advance that they can be reduced together. 

3 Constraints and Contexts 

It is useful at this point to introduce the context C, consisting of the metadata 
terminals alone and the metadata flows. By partitioning Mq into a context part 
C and an application workflow part G, we gain the possibility that a single set of 
constraints agreed upon by some large organization can be applied to multiple 
application workflow graphs, and that a given application workflow graph can be 
run in a variety of contexts. Furthermore, we gain the possibility of factoring the 
context C itself into parts that are of interest to specific individuals or groups. 

Given the wide variety of workflow graphs and contexts possible, it is a 
non-trivial task to design an algebra by which these sets can be combined in a 
simple way. A further problem is that C by itself is not even a well-defined graph 
because some of the arrows in C point to nodes that are not in C. The approach 
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taken here to deal with these problems is to assign types to the graph nodes in 
the application workflow graph. By surveying the set of all possible application 
nodes in an organization, it is possible to come up with a context document Dq 
that, rather than being a graph subset, is a collection of rules for how to apply 
metadata terminals and metadata flows in a real application workflow graph as 
workflow nodes are added into the context document. The rules are indexed 
by node type and may be applied under one of two kinds of semantics: ”only- 
once” semantics or ”for-each” semantics. For metadata terminals, the only-once 
semantics are generally used. For metadata flows, the for-each semantics are 
generally used. This is illustrated in figure 0 This approach works for a large 


Metadata Flow Added 
FOR EACH Application 
Node of Type X 


Metadata Flow Added 
FOR EACH Application 
Node of Type X 



Figure 2: An illustration of a context document. The application workflow node 
type (added in the center center) is used to lookup rules for adding metadata 
flows and/or metadata terminal nodes. Generally, flows are added with ”for- 
every” semantics and nodes are added with ”only-once” semantics. 


number of cases. 

An algebra for combining different context documents is being developed. 
This is somewhat more difficult to do in complete generality not only since the 
documents are not graphs, but also because rules must be developed to handle 
metadata collisions when metadata flows share the same target. In general we 
would like to avoid that the final metadata flow that gets applied depends upon 
the order in which the context documents are processed. Even in the absence 
of a full collision resolution algorithm, there is still a large class of problems for 
which the metadata flows do not collide and are yet useful. There is also another 
set of problems for which the ’’shadowing” behavior of overlapping metadata 
flows is actually desirable, such as for simple replacement of site dependent 
variables with a site independent default. 
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4 Applications 

Many of the ideas presented in this paper have been implemented in the area 
of data processing for HEP. Two related software toolkits have been developed 
for that application and will be discussed here: 

• The RunJob Project at Fermilab J7: provides basic entities to help de¬ 
fine, configure, and execute a workflow. RunJob has metadata terminals, 
metadata flows, a reduction algorithm, and context documents. 

• MCRunjob is a package to create Monte Carlo simulation jobs for the 
CMS H experiment. MCRunjob is based upon the software provided by 
the RunJob Project. 

In the following, the problem of offline data processing and analysis in HEP 
will be described in very broad terms. Then the basic entities provided by the 
RunJob Project will be described, and a concrete example of the contextual 
application of constraints will be described. 

4.1 High Energy Physics Data Analysis 

The expected volume of data from the Large Hadron Collider (LHC) exper¬ 
iments at CERN mm is expected to be very large, on the order of several 
petabytes of data per year. Distributed processing of this data is expected to 
be the norm instead of the exception, both because of the data volume and 
because the collaborators on these experiments are expected to be able to make 
significant contributions to LHC analysis without traveling to CERN for ex¬ 
tended periods of time. New tools and frameworks must be brought to bear 
to help organize the resources towards the successful processing and analysis of 
data under these circumstances, and this in fact generates the interest of HEP 
in Grid technology. 

The successful analysis of HEP data involves the production and analysis 
of large amounts of simulated ” Monte Carlo” data also in order to understand 
detector responses and biases that could affect a measurement. The amount 
of Monte Carlo data produced will be of the same order of size as the actual 
data collected. The production of Monte Carlo usually involves a sequence of 
application programs. The following are just three broad examples. 

• Generator program: This program randomly generates a pure physics 
event taking into account theoretical probabilities for the production of 
subatomic particles and fast decay products associated with a collision 
event. 

• Simulator program: This program steps each particle created by the gen¬ 
erator through a precise model of the collider detector, and calculates the 
amount of energy deposited in each defined detector volume. An LHC 
generated event may contain dozens of individual particles to track, but 
the collider detector may contain hundreds of thousand to millions of in¬ 
dividually modeled detector volumes. 
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• Digitizer program: An "active” detector volume is one which is instru¬ 
mented in the corresponding real detector. For each active detector vol¬ 
ume, this program calculates a digitized electronic signal taking into ac¬ 
count what is known at the time about the electronics. 

The perennial problem in the area of distributed Monte Carlo production is to 
constrain the applications so that the data produced at one site is of the same 
quality as data produced at any other site. 

User analysis is generally approached from the standpoint of how to gain 
individual access to distributed resources and how to move user written analysis 
software to the job execution site. However, here we are more concerned with 
the organizational principles of coordinating the processing data across different 
user jobs. One possible way to think about this problem is as a constraint. Say 
users belong to more than one physics analysis group. Each application that 
the user submits may have dozens to hundreds of parameters to set, and the 
user will not be able in general to set them all by hand. Often, it is the physics 
group that provides help in setting all of the "expert” parameters. In other 
words, the physics group provides a context within which the user must analyze 
data. This is important especially if the user belongs to more than one physics 
group. These might include the following. 

• Detector parameters that only experts know how to set. For example, 
thresholds on the cells in the electromagnetic calorimeter. These param¬ 
eters control whether or not an individual detector component will write 
data into the output stream (i.e.- only if the simulated signal is above 
threshold) 

• Choice of algorithm to use for the reconstruction of a physics object. For 
example, jet clustering, and parameters such as a cone size for that algo¬ 
rithm. Such parameters are often set by algorithm specialists and maybe 
not often set by individual users unless they are doing a study. 

• Choice of input dataset. Modern data management systems such as the 
CMS data management system will have the ability to select datasets by 
a query on the physics metadata. The queries may differ from group to 
group. 

• User written software. This could contain special user parameters or the 
specification of special libraries. 

In many cases, it is important when comparing two different analyses that such 
parameters as outlined above are known and controlled for. Other site depen¬ 
dent parameters best set by an administrator include the following. 

• locating the input data 

• correct placement of the output data. 

• locating of the experiment application software and libraries 



• accessing local batch or Grid resources 

• configuring application parameters governing simulation 

4.2 The RunJob Project 

The RunJob Project was initiated at Fermilab as a common project between 
the DZero and the CMS experiments to combine their respective tools used 
in creating jobs for production of Monte Carlo data. The software produced 
contains the following elements: 

• Modular configuration. Workflow elements consist of modular descrip¬ 
tions of application descriptions and dependencies upon other application 
descriptions. 

• Context driven workflow. 

• Simple framework model of workflow execution. 

• Reduction algorithm for reducing metadata flows. 

A simple diagram illustrating RunJob appears in figure [3 

4.2.1 Modular Configuration 

Workflow elements in RunJob contain key/value parameters that configure the 
application at hand. These parameters can be used to run the application at 
hand directly or to create a job that will run the application later. Workflow 
elements in RunJob are also given a complementary key/value pair description. 
This description is much like a Classed m in Condor. A given workflow ele¬ 
ment has dependencies expressed in terms of the descriptions of other workflow 
elements. This is used to determine execution order of the workflow elements 
or of the jobs they create if they are making jobs. 

Finally, parameters in a workflow element are allowed to be references to 
parameters in other workflow elements by specifying the description of another 
workflow element to reference and the parameter within that workflow element 
to reference. 

4.2.2 Context Driven Workflow 

In the RunJob software, a context document can be given that expresses simple 
constraints on the parameters known to the workflow elements. Each block of 
the context consists of the following information: 

• A header consisting of a workflow element description 

• A body consisting of a list of directives that can either constrain workflow 
parameters in individual workflow elements, define references between pa¬ 
rameters of two different workflow elements, or redefine existing references 
between two different workflow elements. 
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(A) 


Framework Call A 


Framework Call B 


Framework Call C 


Linker WEI WE2 WE3 



Figure 3: (A) The simplified core class structure of RunJob. Workflow ele¬ 
ments consist of dictionaries that describe each application. They in turn have 
descriptions consisting of key/value pairs and lists of dependencies which ref¬ 
erence descriptions of other workflow elements. It can reduce a metadata flow 
using the GetValue method which references a metadata element in another 
workflow element by specifying elements from its description dictionary. The 
workflow elements are constrained by an entity called the Linker, which also 
issues framework messages to the workflow elements. Each workflow element 
may have a handler registered to handle the framework message are task. (B) 
An interaction diagram showing the pattern of framework messages for three 
workflow elements (assumed to be in dependency order) and three framework 
messages or tasks. Workflow elements may handle the framework call or not. 
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The context is loaded into the system prior to any workflow elements. When 
a workflow element is added to the system, the directives within any context 
block whose header matches the description of the added workflow element are 
applied to that workflow element. In the RunJob software the match must be 
exact in the sense that all key/value elements in the context block header must 
match the workflow element description. 4 The workflow element description 
may have more key elements than the context block header in a match. 

Different context documents may be loaded into the system simultaneously. 
Currently, the RunJob system does not implement any conflict resolution algo¬ 
rithm on similar context blocks emanating from different documents. Rather, 
the document specified last wins any conflict. 

4.2.3 Framework Model of Workflow 

In the RunJob software, the workflow is executed in framework fashion. The 
user can define a set of framework tasks, each framework task being represented 
by a string. After all workflow elements have been added to the system, it will 
issue each framework task in the form of a message to each workflow element. 
The order of the workflow elements is determined by their dependencies. 5 The 
workflow elements may or may not respond to specific framework calls depending 
on how they are defined and configured. 

This arrangement is logically equivalent to a DAG in the case that there 
is one framework task and the nodes are laid out in dependency order. The 
DAG has the advantage that in the case of a node failure, it may proceed to 
process independent nodes. A comparison of the methods of executing workflow 
elements is beyond the scope of this paper. Many other workflow systems do in 
fact use DAGMan, developed at the University of Wisconsin at Madison, as a 
workflow execution manager. 

4.2.4 Reduction Algorithm 

Reduction in RunJob is accomplished by lazy evaluation of metadata flows. 
The metadata flow is evaluated by inter-workflow element lookup only when 
the target of the metadata flow is accessed by some other entity or metadata 
flow. This is accomplished by setting read triggers on the attributes of RunJob 
workflow elements. 

4.3 MCRunjob 

MCRunjob 0 is a workflow management system based upon the RunJob soft¬ 
ware for the CMS experiment. MCRunjob, however, is used for building jobs 
to produce Monte Carlo data; it does not actually execute the jobs themselves. 

4 This has been relaxed a bit in practice with wildcards. 

5 In cases where there are no dependencies to determine order, then the workflow elements 
appear in the order in which they were added to the system. 
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Rather MCRunjob executes the workflow of ’’configuring the jobs in the execu¬ 
tion workflow.” This is nonetheless an ideal place to implement the ideas in this 
paper as they have to do with constraints on workflow configuration. MCRun¬ 
job produces workflows for execution on DAGMan/Condor-G or on local batch 
systems. 

The simplified MCRunjob workflow in the example contains the following 
framework tasks. 

• contactDB: Contact the CMS Production Control Database 

• configureJob: Reduce constraints by referencing metadata targets 

• makeJob: Create and store application jobs into the Linker 

• runJob: Retrieve and submit stored jobs 

There is also an implicit reset framework task which is not shown . 6 If the 
workflow happens to contain a serialized DAG of user applications, then the 
jobs are written out in the same serial order and synchronized by some external 
means. 

4.4 Practical Example 

In this section, we present a simple workflow graph G for the CMS experiment. 
It is operated upon by a series of context documents Ci, C 2 , and C 3 resulting 
in a final unconstrained workflow graph Mq to be subsequently reduced by the 
RunJob lazy reduction algorithm. 

The example here is based upon a real example from Monte Carlo production 
for the CMS experiment. It is not, however, an actual real life example. The 
real life examples contain far too many confounding specializations that are 
required in order to get actual Monte Carlo production done in the current 
imperfect but evolving Grid infrastructure. Also, real CMS production does not 
at this time constrain physics parameters by physics group. Rather, all actual 
production parameters come from a production control database known as the 
RefDB 23. I n one case, the syntax for the context block headers has been 
simplified. Finally, many of the names of workflow elements and parameters 
have been changed to be more illustrative and less comprised of CMS jargon. 
The applications described were discussed in section Q] 

Consider the unconstrained workflow G defined in section TA.ll In this work- 
flow, a user has chosen to run the generator program CMKIN followed by the 
simulation program OSCAR as well as the Digitization program. The constraint 
that the input data of one step be read from the output of the previous step is 
expressed as a metadata flow. In MCRunjob, the workflow graph is just speci¬ 
fying the steps in the creation of a job, so a special ’’RunJob” workflow element 
is added at the end to submit the jobs. 

6 The details of how this workflow is executed efficiently to create N parallel jobs is unim- 
portant here. But briefly, there is a special group called the ’’onGroup” of framework calls 
that is executed N times. 


12 



The first context file C\ is shown in section m and defines the calls in 
the RunJob framework, described in section 0J Calls are added to two prede¬ 
fined groups: ’’preGroup” which is executed only once, and ’’onGroup” which 
is executed as many times as there are jobs to be created. 

The second context file C 2 is a hypothetical context file written by a physics 
group. (The corresponding real file used for CMS Monte Carlo production is 
very long and unenlightening.) It exhibits the for-each and once-only semantics 
described in section [U For each added workflow node matching a type declara¬ 
tion in the header of a context block, metadata flows are added and dependencies 
are added. At the bottom of this context file, two metadata terminal nodes are 
added once only. The RefDB is a metadata source corresponding to the CMS 
production control database, and the PlrysicsGroupDB is a metadata source 
corresponding to a hypothetical database that could be set up by a physics 
group to organize parameters that people in the group should use. 

The third context hie C 3 is a scheduler context hie. (The corresponding 
real hie used in CMS contains configurations for many different batch sys¬ 
tems.) This context hie substitutes a concrete choice of a workflow element, 
LCG_ResourceBroker, for an abstract choice given in the workhow dehnition, 
RunJob. 

The context hies given in the appendices do not collide on metadata targets 
because they have been organized on the basis of the node types that they mod¬ 
ify. No collision algorithm is needed. When the contexts are combined with the 
workhow, the hnal unconstrained workhow Mq results. It is given in section 
sc:fuwrkhw. As explained in section^ RunJob reduces the unconstrained work¬ 
how into a fully constrained workhow graph (not shown) using a lazy algorithm 
that triggers the evaluation of a constraint when its target is accessed. RunJob 
does not implement more general Boolean constraints at this time. 

Note that if the user had to specify all of the constraints in the hnal uncon¬ 
strained workhow, it would be a very complex task. MCRunjob currently tracks 
about 150 parameters for CMS Monte Carlo production, mostly coming from 
the RefDB or describing local site conditions. As the system in CMS gets more 
complex and involves more applications and users, the number of parameters 
can only be expected to grow. By splitting the work into different context hies, 
the work can be shared among different roles: A naive user creates the initial 
unconstrained workhow in section rm a developer maintains the framework 
context hie in section rA . 21 a physics group convener maintains the physics con¬ 
text hie in roi and an administrator maintains the scheduler context hie in 
section IA.4I 

5 Conclusion and Relation to Other Work 

This paper outlined some considerations for constraint modeling in Grid applica¬ 
tion workflows. We focused mainly on semantic constraints and not constraints 
on physical resources or monitoring. We have shown how multigraphs and meta¬ 
data hows can be used to express constraints, and outlined a general procedure 
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for reducing multigraphs into fully constrained graph workflow descriptions. Fi¬ 
nally, we have shown how to factor out the constraints into contexts that can be 
maintained separately and recombined later in a collaborative effort to constrain 
a workflow. 

5.1 Related Work 

In addition to being used in creating jobs in MCRunjob, contexts have been 
demonstrated that govern the generation of fully constrained workflows as or¬ 
dered lists of shell scripts, Condor-G/DAGMan El, or Chimera Virtual Data 
Language (VDL) [J2[ all from the same simple workflow. The mechanism is to 
use a context to select one or more code generators. DAGMan is a complete 
workflow manager that has a very general model for specifying workflows, but it 
relies on the user completely to set up both the DAG and the parameters in the 
DAG itself. VDL presents a unique view of data processing as a network of data 
products connected by application induced transformations. While it is possible 
for different collaborators to work independently on different transformations, 
there are not really tools for allowing collaborators to independently specify 
different aspects of the same chain of transformations. The context mechanism 
presented here emphasizes rather the idea of virtual transformation as opposed 
to the virtual data. 

Work on Context Oriented Programming m is being done for mobile com¬ 
puting. Systems are being developed to exchange the actual code that gets 
run in different locales. The present work is different in that it is effectively 
exchanging constraints and not actual code, although contexts can be fooled 
into loading different modules as shown in the example. Also, the definition of 
’’locale” here is generalized to be any category of relevance to the VO: physics 
group, a personal role, et cetera. 

Other workflow engines such as Triana m and Webflow m emphasize 
graphical user interfaces and allow the user to visually determine where individ¬ 
ual elements are executed. Contexts are more suited to text based input. Since 
they are not graphs, they are more difficult to visualize. 

An interesting related work is GridAnt ESI- The problem space of GridAnt 
is similar: it tries to tame the application space of the Grid by formulating 
the workflow as an Ant style Makefile with additional procedural constructs. 
However, it too does not facilitate collaborative workflow in the same way that 
contexts do. 

5.2 Comparison to Cascading Style Sheets 

A fresh perspective on the context mechanism of RunJob may be obtained by 
considering Cascading Style Sheets (CSS)|2T|. A CSS document is loaded into 
a browser before processing HTML documents. The CSS document consists 
of blocks with headers designed to match hierarchically defined segments of an 
HTML document. When an HTML document is loaded, each HTML element is 
processed according to configuration directives contained within a block of CSS 
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corresponding to the segment in which the HTML element appears. A Run Job 
context is thus like CSS for workflow configuration. Though the context block 
headers of Run Job don’t appear to be hierarchical, they can be if one chooses 
an ordered sequence of keys to use in workflow element descriptions. 

5.3 Contextual Constraints and Provenance 

Provenance deals with the problem of collecting all of the information needed 
in order to recreate a data product. The transformation graph approach of 
Chimera is most useful here DS Detailed provenance is therefore built into 
the system. However, the prospect of saving metadata flows and metadata 
terminal nodes before the process of reduction begins on a constrained workflow 
graph offers the possibility of saving a new kind of provenance. Namely, in 
addition to saving the flat values of all of the parameters that go into creating a 
data product, one can also save the constraints and relationships among those 
parameters and, by extension, why the parameters in a conventional provenance 
have the values that they do: which calibration set is being used, is it being used 
across all workflow steps, who signed off on the set of constraints as a whole and 
not just considering each constraint one by one. This is because the provenance 
as expressed in a workflow constraint mechanism is categorical. 


A Example Workflow Specification Files 

The following sections contain code from CMS MCRunjob. The language is 
called ’’macro language”. Each statement in macro language is interpreted to 
specific calls on the R.unJob API. In the future, the macro language will be 
deprecated in favor of using the Run Job API directly and expressing workflows 
and contexts in pure Python. The statements of the macro language below are 
generally self explanatory. 

A.l Unconstrained Workflow Definition 

This unconstrained workflow creates workflow elements for three CMS applica¬ 
tions described in section 0 and links them together. The condition that one 
application read input from the output of the previous step is expressed as a 
metadata flow. 

attach CMKIN 
attach OSCAR 
OSCAR adddep CMKIN 

OSCAR define inputFile ::CMKIN:outputFile 
attach Digitization 
Digitization adddep OSCAR 

Digitization define inputDataset ::OSCAR:outputDataset 
Digitization define inputRunNumber ::OSCAR:outputRunNumber 
attach RunJob 
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framework run 


A.2 Framework.ctx 

The following context file simply defines the content and ordering of the frame¬ 
work calls issued by the Linker. The ’’preGroup” is executed first exactly once, 
and the ’’onGroup” is executed as many times as there are jobs to create. 

framework define preGroup contactDB 

framework define onGroup configureJob,makeJob,runJob 

A.3 PhysicsGroup.ctx 

The following context hie is a hypothetical physics group context. It requires the 
addition of two metadata sources: the CMS production control database RefDB 
and a hypothetical physics group database. Parameters for various applications 
are constrained either directly in the context or are constrained by metadata 
hows from one or the other database. Special directives are given to ensure that 
connections are opened when the Linker issues the ’’connectDB” framework call. 
The description keys are ’’Database” for the RefDB and the PhysicsGroupDB, 
and ’’Application” for the CMS physics applications. 

contextBlock Database=PhysicsGroupDB,RefDB 
oncall contactDB do connectToDatabase 

end 

contextBlock Application=CMKIN,OSCAR,Digi 
add dependency Database=PhysicsGroupDB 
add dependency Database=RefDB 

end 

contextBlock Application=CMKIN 
define ApplicationVersion 6.133 
define ApplicationName kine\_make\_ntuple.exe 
define HiggsMass ::PhysicsGroupDB:HMass2004 
define TopMass ::PhysicsGroupDB:TMass2004 
end 

contextBlock Application=OSCAR 

define ApplicationVersion 0SCAR\_3\_6\_5 
define HCal On 
define ECal On 

define ECalThreshold PhysicsGroupDB:ECalThreshold2004 
end 

contextBlock Application=Digitization 
define ApplicationVersion 0RCA\_8\_4\_1 
define PileupRate ::RefDB:Lumi_1032 
end 

attach PhysicsGroupDB 
attach RefDB 
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A.4 Scheduler.ctx 


The following context file defines the alias ” Run Job” to mean ” LCG_ResourceBroker”. 
When the novice user adds the generic ” RunJob” element to a workflow, he/she 
gets the LCG_ResourceBroker element when this context is loaded. The context 
also contains configuration information for the LCG_ResourceBroker. The tag 
”@args” means that the metadata flow comes from command line arguments. 

namespace add RunJob Scheduler=LCG_ResourceBroker 
contextBlock Scheduler=LCG_ResourceBroker 
define UserJDLFile ::Oargs:UserJDLFile 
define ResourceBroker ::@args:ResourceBroker 
oncall RunJob do submit 
end 


A.5 Full unconstrained Workflow Definition 

The above context files are loaded into the system first. (Remember that the 
PhysicsGroup.ctx also adds two metadata sources at this time.) Upon addi¬ 
tion of each element from the workflow of section rrn the metadata flows and 
other directives from matching context blocks are added to that element. The 
following final unconstrained workflow appears below. The metadata flows are 
reduced out as elements are accessed. 

framework define preGroup contactDB 

framework define onGroup configureJob,makeJob,runJob 
attach PhysicsGroupDB 

PhysicsGroupDB oncall contactDB do connectToDatabase 
attach RefDB 

RefDB oncall contactDB do connectToDatabase 
attach CMKIN 

CMKIN add dependency Class=PhysicsGroupDB 

CMKIN namespace add PhysicsGroupDB Class=PhysicsGroupDB 

CMKIN add dependency Class=RefDB 

CMKIN namespace add RefDB Class=RefDB 

CMKIN define ApplicationVersion 6.133 

CMKIN define ApplicationName kine\_make\_ntuple.exe 

CMKIN define HiggsMass ::PhysicsGroupDB:HMass2004 

CMKIN define TopMass ::PhysicsGroupDB:TMass2004 

attach OSCAR 

OSCAR add dependency Class=PhysicsGroupDB 

OSCAR namespace add PhysicsGroupDB Class=PhysicsGroupDB 

OSCAR add dependency Class=RefDB 

OSCAR namespace add RefDB Class=RefDB 

OSCAR define ApplicationVersion 0SCAR\_3\_6\_5 

OSCAR define HCal On 

OSCAR define ECal On 
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OSCAR define ECalThreshold :;PhysicsGroupDB:ECalThreshold2004 
OSCAR adddep CMKIN 

OSCAR define inputFile ::CMKIN:outputFile 
attach. Digitization 

Digitization add dependency Class=PhysicsGroupDB 
Digitization namespace add PhysicsGroupDB Class=PhysicsGroupDB 
Digitization add dependency Class=RefDB 
Digitization namespace add RefDB Class=RefDB 
Digitization define ApplicationVersion 0RCA\_8\_4\_1 
Digitization define PileupRate ::RefDB:Lumi_1032 
Digitization adddep OSCAR 

Digitization define inputDataset ::OSCAR:outputDataset 
Digitization define inputRunNumber ::OSCAR:outputRunNumber 
attach LCG_ResourceBroker 

LCG_ResourceBroker define UserJDLFile ::@args:UserJDLFile 
LCG_ResourceBroker define ResourceBroker ::Oargs:ResourceBroker 
LCG_ResourceBroker oncall RunJob do submit 
framework run 
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